5 months ago Kaggle, a website that offers competitions to teams of data scientists for cash prizes, released their annual user survey. This comprehensive survey asked numerous questions to the Kaggle members in order to collect metrics on it’s user base. Our group selected this data to serve as our data set for determining the top hard and soft skills required for a data scientist.
Breakdown:
Credit to Amber Thomas for providing the following code used for extracting and summarizing answers to multiple-choice questions.
chooseOne = function(question){
exp_df %>%
filter(!UQ(sym(question)) == "") %>%
dplyr::group_by_(question) %>%
dplyr::summarise(count = n()) %>%
dplyr::mutate(percent = (count / sum(count)) * 100) %>%
dplyr::arrange(desc(count))
}
chooseMultiple = function(question,df){
df %>%
dplyr::filter(!UQ(sym(question)) == "") %>%
dplyr::select(question) %>%
dplyr::mutate(totalCount = n()) %>%
dplyr::mutate(selections = strsplit(as.character(UQ(sym(question))),
'\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl = TRUE)) %>%
unnest(selections) %>%
dplyr::group_by(selections) %>%
dplyr::summarise(totalCount = max(totalCount),
count = n()) %>%
dplyr::mutate(percent = (count / totalCount) * 100) %>%
dplyr::arrange(desc(count))
}
Academic_exploration=function(question,df){
df %>%
filter(!UQ(sym(question)) == "") %>%
dplyr::group_by_(question) %>%
dplyr::summarise(count = n()) %>%
dplyr::mutate(percent = (count / sum(count)) * 100) %>%
dplyr::arrange(desc(count))
}
proportion_function <- function(vec){
vec/sum(vec)*100
}
create_breaks <- function(dfcolumn,breaks,labels){
dfcolumn <- as.numeric(dfcolumn)
dfcolumn <- cut(dfcolumn,breaks=breaks,labels=labels,right=FALSE)
}
INTRODUCTION HERE
CONCLUSION
INTRODUCTION
CONCLUSION
In this section, we examine how data scientists gained their skill set. We believe there may be valuable insight in what makes a strong data scientist by examing how successful data scientists gained their skill set.
Our data shows a great diversity in learning styles. This indicates that not only do data scientists learn from a variety of sources, but every data scientist’s sources vary in importance. This highlights the idea that there is no right or wrong way to learn to become a data scientist. At the same time, as the four major categories amount for nearly 100% of education, this means that there are no “secret” learning sources.
It is interested to note that nearly 75% of data scientists indicate they learned while on the job.
In this section, we explore commonly used algorithms and methods that are presumably required as basic skills in data science field.
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
It appears that on average, data scientists use at least 3 algorithms and 7 methods in their work. As the bar graph shows above, the most commonly used algorithms and methods as follows:
An average data scientist is able to the above listed algorithms and methods as basic hard skills to meet the standard industry expectation. An exceptional data scientist may be capable of handling 7 to 30 methods and 4 to 15 algorithms.
Furthermore, the most commonly used size of dataset appears to fall in the 1GB ~ 10GB range ( > 50%). For reference, the last graph displays the most used methods by size of dataset.
INTRODUCTION
CONCLUSION
In this section, we address the challenges faced by Data Scientists, and how their time is typically spent at work. We believe that the time spent performing data science related tasks and their respective challenges will provide useful insights on the skills necessary to succeed as a data scientist.
The data shows that data scientists spend a whopping 34% of their time gathering a cleaning data. Almost 25% of their time is spent selecting/building models, and 27% of their time is spent visualizing, discovering, and communicating insights to stakeholders. This is evidence that data scientists must have superb data cleaning and modeling skills. Data scientists must be able to visually and verbally communicate their findings to stakeholders.
Interestingly, dirty data is the most prevalent challenge, at 48%. A staggering 39% of data scientists were challenged by issues related to company politics and financial/management support. Interpersonal skills are vital in navigating office politics. Technical writing skills may aid in drafting proposals for financial support. 31% of respondents reported challenges with data access and availability, therefore advanced data acquisition skills is an for data scientists.
24% of responders reported issues of unused results from data science projects. This is alarmning, given that data science can be very expensive. Honing communication skills may reduce the proportion of unused results.
One in four data scientists lack a clear question to answer and a direction to take with the data, one in five data scientists reported challenges of explaining data science to others, and one in seven data scientists reported issues with maintaining reasonable expactations for data science projects. These all speak to communication skills and the ingenuity/creativity to frame questions and problems in such a way that will garner proper responses from stakeholders.
It would have been interesting to investigate the relationship between the rate of unused results and the variables related to communication. However the results were internally randomized and results in the same row may not be from the same responder.
HARD SKILLS
SOFT SKILLS
In conclusion…
The original kaggle data was in an untidy form. As part of data preparation we each created tidy data sub sets and saved them to a series of csv files stored on our github. The following SQL script will import them into a series of tables. We hope that this will aid future research and help to find connections that we may have missed.
holding